Goto

Collaborating Authors

 query time


Practical Near Neighbor Search via Group Testing: Supplementary Materials

Neural Information Processing Systems

In this section, we provide proofs for all of the theorems introduced in the main text. We begin with a simple extension of the results of [3] for the Bloom filter false positive and negative rates. Then, we prove our main claim, which is that the query time of our data structure is sublinear, given some relatively weak assumptions on the stability of the query. Theorem 1. Assuming the existence of an LSH family with collision probability s(x,y) = sim(x,y), the distance-sensitive Bloom filter solves the approximate membership query problem with p 1 exp 2m t/m+ SLH We begin with a brief explanation of the results from [3]. Recall that a distance-sensitive Bloom filter is a collection of mbit arrays. Array iis indexed using an independent LSH function li(x). To insert a point xinto the ith array, we set the bit at location li(x) to '1.' To query the filter, we calculate the mhash values of the query and return "true" when at least tof the corresponding bits are '1.' To bound p (the true positive rate) and q (the false positive rate), we bound the probability that a single array returns "true."


Practical Near Neighbor Search via Group Testing

Neural Information Processing Systems

We present a new algorithm for the approximate near neighbor problem that combines classical ideas from group testing with locality-sensitive hashing (LSH). We reduce the near neighbor search problem to a group testing problem by designating neighbors as "positives," non-neighbors as "negatives," and approximate membership queries as group tests.


Faster Query Times for Fully Dynamic k-Center Clustering with Outliers

Neural Information Processing Systems

Given a point set P M from a metric space (M,d)and numbers k,z N, the metric k-center problem with z outliers is to find a set C P of k points such that the maximum distance of all but at most z outlier points of P to their nearest center in C is minimized. We consider this problem in the fully dynamic model, i.e., under insertions and deletions of points, for the case that the metric space has a bounded doubling dimension dim. We utilize a hierarchical data structure to maintain the points and their neighborhoods, which enables us to efficiently find the clusters. In particular, our data structure can be queried at any time to generate a (3 + ฮต)-approximate solution for input values of k and z in worst-case query time ฮต O(dim)klognloglog, where is the ratio between the maximum and minimum distance between two points in P. Moreover, it allows insertion/deletion of a point in worst-case update time ฮต O(dim) lognlog . Our result achieves a significantly faster query time with respect to k and z than the current state-of-theart by Pellizzoni, Pietracaprina, and Pucci [18], which uses ฮต O(dim)(k+z)2 log query time to obtain a (3+ฮต)-approximate solution.


Statistical-Computational Trade-offs for Density Estimation

Neural Information Processing Systems

We study the density estimation problem defined as follows: given $k$ distributions $p_1, \ldots, p_k$ over a discrete domain $[n]$, as well as a collection of samples chosen from a query distribution $q$ over $[n]$, output $p_i$ that is close to $q$. Recently Aamand et al. gave the first and only known result that achieves sublinear bounds in both the sampling complexity and the query time while preserving polynomial data structure space. However, their improvement over linear samples and time is only by subpolynomial factors.Our main result is a lower bound showing that, for a broad class of data structures, their bounds cannot be significantly improved. In particular, if an algorithm uses $O(n/\log^c k)$ samples for some constant $c> 0$ and polynomial space, then the query time of the data structure must be at least $k^{1-O(1)/\log \log k}$, i.e., close to linear in the number of distributions $k$. This is a novel statistical-computational trade-off for density estimation, demonstrating that any data structure must use close to a linear number of samples or take close to linear query time. The lower bound holds even in the realizable case where $q=p_i$ for some $i$, and when the distributions are flat (specifically, all distributions are uniform over half of the domain $[n]$). We also give a simple data structure for our lower bound instance with asymptotically matching upper bounds. Experiments show that the data structure is quite efficient in practice.


Statistical-ComputationalTrade-offsforDensity Estimation

Neural Information Processing Systems

Inparticular,ifanalgorithm uses O(n/logck) samples for some constantc > 0 and polynomial space, then the query time of the data structure must be at leastk1 O(1)/loglogk, i.e., close to



Space and Time Efficient Kernel Density Estimation in High Dimensions

Neural Information Processing Systems

However, their data structure requires a significantly increased super-linear storage space, as well as super-linear preprocessing time. These limitations inhibit the practical applicability of their approach on large datasets.


GraphReorderingforCache-EfficientNearNeighbor Search

Neural Information Processing Systems

Graph search is one of the most successful algorithmic trends in near neighbor search. Severalofthemostpopular andempirically successful algorithms are,at their core, a greedy walk along a pruned near neighbor graph.


NearNeighbor

Neural Information Processing Systems

We show that LSH based algorithms can be made fair, without a significant loss in efficiency. Specifically, we show an algorithm that reports a point in the rneighborhood of a query q with almost uniform probability.